62 research outputs found

    An automated ETL for online datasets

    Get PDF
    While using online datasets for machine learning is commonplace today, the quality of these datasets impacts on the performance of prediction algorithms. One method for improving the semantics of new data sources is to map these sources to a common data model or ontology. While semantic and structural heterogeneities must still be resolved, this provides a well established approach to providing clean datasets, suitable for machine learning and analysis. However, when there is a requirement for a close to real time usage of online data, a method for dynamic Extract-Transform-Load of new sources data must be developed. In this work, we present a framework for integrating online and enterprise data sources, in close to real time, to provide datasets for machine learning and predictive algorithms. An exhaustive evaluation compares a human built data transformation process with our system’s machine generated ETL process, with very favourable results, illustrating the value and impact of an automated approach

    An automated ETL for online datasets

    Get PDF
    While using online datasets for machine learning is commonplace today, the quality of these datasets impacts on the performance of prediction algorithms. One method for improving the semantics of new data sources is to map these sources to a common data model or ontology. While semantic and structural heterogeneities must still be resolved, this provides a well established approach to providing clean datasets, suitable for machine learning and analysis. However, when there is a requirement for a close to real time usage of online data, a method for dynamic Extract-Transform-Load of new sources data must be developed. In this work, we present a framework for integrating online and enterprise data sources, in close to real time, to provide datasets for machine learning and predictive algorithms. An exhaustive evaluation compares a human built data transformation process with our system’s machine generated ETL process, with very favourable results, illustrating the value and impact of an automated approach

    An automated ETL for online datasets

    Get PDF
    While using online datasets for machine learning is commonplace today, the quality of these datasets impacts on the performance of prediction algorithms. One method for improving the semantics of new data sources is to map these sources to a common data model or ontology. While semantic and structural heterogeneities must still be resolved, this provides a well established approach to providing clean datasets, suitable for machine learning and analysis. However, when there is a requirement for a close to real time usage of online data, a method for dynamic Extract-Transform-Load of new sources data must be developed. In this work, we present a framework for integrating online and enterprise data sources, in close to real time, to provide datasets for machine learning and predictive algorithms. An exhaustive evaluation compares a human built data transformation process with our system’s machine generated ETL process, with very favourable results, illustrating the value and impact of an automated approach

    Validating a novel web-based method to capture disease progression outcomes in multiple sclerosis

    Get PDF
    The Expanded Disability Status Scale (EDSS) is the current ‘gold standard’ for monitoring disease severity in multiple sclerosis (MS). The EDSS is a physician-based assessment. A patient-related surrogate for the EDSS may be useful in remotely capturing information. Eighty-one patients (EDSS range 0–8) having EDSS as part of clinical trials were recruited. All patients carried out the web-based survey with minimal assistance. Full EDSS scores were available for 78 patients. The EDSS scores were compared to those generated by the online survey using analysis of variance, matched pair test, Pearson’s coefficient, weighted kappa coefficient, and the intra-class correlation coefficient. The internet-based EDSS scores showed good correlation with the physician-measured assessment (Pearson’s coefficient = 0.85). Weighted kappa for full agreement was 0.647. Full agreement was observed in 20 patient

    A method for automated transformation and validation of online datasets

    Get PDF
    While using online datasets for machine learning is commonplace today, the quality of these datasets impacts on the performance of prediction algorithms. One method for improving the semantics of new data sources is to map these sources to a common data model or ontology. While semantic and structural heterogeneities must still be resolved, this provides a well established approach to providing clean datasets, suitable for machine learning and analysis. However, when there is a requirement for a close to real time usage of online data, a method for dynamic Extract-Transform-Load of new sources data must be developed. In this work, we present a framework for integrating online and enterprise data sources, in close to real time, to provide datasets for machine learning and predictive algorithms. An exhaustive evaluation compares a human built data transformation process with our system’s machine generated ETL process, with very favourable results, illustrating the value and impact of an automated approach

    Multi-resolution forecast aggregation for time series in agri datasets

    Get PDF
    A wide variety of phenomena are characterized by time dependent dynamics that can be analyzed using time series methods. Various time series analysis techniques have been presented, each addressing certain aspects of the data. In time series analysis, forecasting is a challenging problem when attempting to estimate extended time horizons which effectively encapsulate multi-step-ahead (MSA) predictions. Two original solutions to MSA are the direct and the recursive approaches. Recent studies have mainly focused on combining previous methods as an attempt to overcome the problem of discarding sequential correlation in the direct strategy or accumulation of error in the recursive strategy. This paper introduces a technique known as Multi-Resolution Forecast Aggregation (MRFA) which incorporates an additional concept known as Resolutions of Impact. MRFA is shown to have favourable prediction capabilities in comparison to a number of state of the art methods

    Modelling piezo-spectroscopic data

    Get PDF
    Information from the many kinds of spectroscopy used by chemists and physicists is fundamental to our understanding of the structure of materials. Numerical techniques have an important role to play in the augmentation of the instrumentation and technology available in the laboratory, but are frequently viewed as separate from the laboratory procedures. We examine the model approaches which are currently applied in spectroscopy and determine their applicability to piezo-spectroscopic data. Typically, in piezo-spectroscopic modelling the analyses in question are required to handle large complex secular matrices, to distinguish between components in the experimental results, and to identify the transition types as rapidly and as efficiently as possible. The method proposed is based on providing a shell to the Powell or Fletcher-Reeves minimisation algorithms, and gives favourable results compared to those previously used. Additionally, the statistical properties of the least-squares estimator used in the Powell-shell are examined and implications for nonlinear model functions are discussed. We also show that the least squares estimator performs well for piezo-spectroscopic data compared to those currently used in multi-response data analysis. Finally we describe the development of a software tool which incorporates all features of fitting piezo-spectroscopic data

    Variable interactions in risk factors for dementia

    Get PDF
    Current estimates predict 1 in 3 people born today will develop dementia, suggesting a major impact on future population health. As such, research needs to connect specialist clinicians, data scientists and the general public. The In-MINDD project seeks to address this through the provision of a Profiler, a socio-technical information system connecting all three groups. The public interact, providing raw data; data scientists develop and refine prediction algorithms; and clinicians use in-built services to inform decisions. Common across these groups are Risk Factors, used for dementia-free survival prediction. Risk interactions could greatly inform prediction but determining these interactions is a problem underpinned by massive numbers of possible combinations. Our research employs a machine learning approach to automatically select best performing hyperparameters for prediction and learns variable interactions in a non-linear survival-analysis paradigm. Demonstrating effectiveness, we evaluate this approach using longitudinal data with a relatively small sample size

    An architecture and services for constructing data marts from online data sources

    Get PDF
    The Agri sector has shown an exponential growth in both the requirement for and the production and availability of data. In parallel with this growth, Agri organisations often have a need to integrate their in-house data with international, web-based datasets. Generally, data is freely available from official government sources but there is very little unity between sources, often leading to significant manual overhead in the development of data integration systems and the preparation of reports. While this has led to an increased use of data warehousing technology in the Agri sector, the issues of cost in terms of both time to access data and the financial costs of generating the Extract-Transform-Load layers remain high. In this work, we examine more lightweight data marts in an infrastructure which can support on-demand queries. We focus on the construction of data marts which combine both enterprise and web data, and present an evaluation which verifies the transformation process from source to data mart
    corecore